A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of the product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch.
Start R and use these commands to load the data:
> library(AppliedPredictiveModeling)
> data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
# Load data
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in those missing values (e.g., see Sect. 3.8).
We’ll first generate a summary to get a first look at the data and to show missing values, then we’ll get a count of missing values per variable.
# Show missing value counts
dfchem <- ChemicalManufacturingProcess # To avoid typing this many letters
data.frame(Missing=colSums(is.na(dfchem))) %>%
filter(Missing > 0) %>%
kbl(caption='Missing value counts') %>%
kable_classic(full_width=F)| Missing | |
|---|---|
| ManufacturingProcess01 | 1 |
| ManufacturingProcess02 | 3 |
| ManufacturingProcess03 | 15 |
| ManufacturingProcess04 | 1 |
| ManufacturingProcess05 | 1 |
| ManufacturingProcess06 | 2 |
| ManufacturingProcess07 | 1 |
| ManufacturingProcess08 | 1 |
| ManufacturingProcess10 | 9 |
| ManufacturingProcess11 | 10 |
| ManufacturingProcess12 | 1 |
| ManufacturingProcess14 | 1 |
| ManufacturingProcess22 | 1 |
| ManufacturingProcess23 | 1 |
| ManufacturingProcess24 | 1 |
| ManufacturingProcess25 | 5 |
| ManufacturingProcess26 | 5 |
| ManufacturingProcess27 | 5 |
| ManufacturingProcess28 | 5 |
| ManufacturingProcess29 | 5 |
| ManufacturingProcess30 | 5 |
| ManufacturingProcess31 | 5 |
| ManufacturingProcess33 | 5 |
| ManufacturingProcess34 | 5 |
| ManufacturingProcess35 | 5 |
| ManufacturingProcess36 | 5 |
| ManufacturingProcess40 | 1 |
| ManufacturingProcess41 | 1 |
# Use MICE to impute missing values
imp <- mice(dfchem, printFlag=F)
dfchem2 <- complete(imp)
# Verify no more missing values exist
print(paste0('Missing values after imputation: ', sum(is.na(dfchem2))))## [1] "Missing values after imputation: 0"
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
First, we’ll choose a random seed for reproducible results. Then we’ll split the data using createDataPartition(), which attempts to split the data in such a way that the training and test sets will an approximately proportionate distributions of the outcome variable.
# Set seed
set.seed(777)
# Split into train/test; createDataPartition generates indicies of the training set
train_indices <- createDataPartition(dfchem2$Yield, p=0.80, times=1, list=F)
dftrain <- dfchem2[train_indices,]
dftest <- dfchem2[-train_indices,]
# Examine train and test sets
hist(dftrain$Yield, main='Training set yield histogram')hist(dftest$Yield, main='Test set yield histogram')Now we’ll examine the predictors to look for correlations and examine their relationship to the outcome variable.
# Corr chart - not useful here because of the number of predictors
# chart.Correlation(dftrain)
# Corr plot
corr1 <- cor(dftrain)
corrplot(corr1, order='hclust', type='upper', diag=F)# Show candidates for removal
print("Candidates for removal due to high correlation:")## [1] "Candidates for removal due to high correlation:"
high_corr <- findCorrelation(corr1, cutoff=0.9, exact=T, verbose=F, names=F)
colnames(dftrain[,high_corr])## [1] "BiologicalMaterial02" "BiologicalMaterial04" "BiologicalMaterial12"
## [4] "ManufacturingProcess29" "ManufacturingProcess42" "ManufacturingProcess26"
## [7] "ManufacturingProcess25" "ManufacturingProcess31" "ManufacturingProcess18"
## [10] "ManufacturingProcess40"
# Remove the highly correlated variables
dftrain2 <- dftrain[,-high_corr]
dftest2 <- dftest[,-high_corr]Now we’ll try modeling. First, prepare a data frame for the results, separate the outcome and predictor variables, and set cross-validation parameters.
# Results data frame
dfr <- data.frame(matrix(nrow=10, ncol=3))
colnames(dfr) <- c('Model', 'Tuning.Parameters', 'RMSE')
# Separate outcome and predictors
trainx <- dftrain2 %>% dplyr::select(-Yield)
trainy <- dftrain2$Yield
testx <- dftest2 %>% dplyr::select(-Yield)
testy <- dftest2$Yield
# specify 10x cross-validation
ctrl <- trainControl(method='cv', number=10)Linear regression
# Linear model
set.seed(77)
fit <- train(x=trainx, y=trainy, method='lm', trControl=ctrl)
fit## Linear Regression
##
## 144 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 129, 129, 129, 131, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.286066 0.5084493 1.369764
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
dfr[1,] = data.frame(MOdel='LM', Tuning.Parameters='', RMSE=fit$results[['RMSE']])
# Stepwise linear model using MASS package/caret
set.seed(77)
stepGrid <- data.frame(.nvmax=seq(1, round(ncol(trainx) / 2, 0))) # Max number of parameters to use
fit <- train(x=trainx, y=trainy, method='leapSeq', trControl=ctrl, tuneGrid=stepGrid)## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
fit## Linear Regression with Stepwise Selection
##
## 144 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 129, 129, 129, 131, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 1.624664 0.2226359 1.316949
## 2 1.375821 0.4152041 1.162999
## 3 1.296868 0.4858057 1.071156
## 4 1.298437 0.5019131 1.059803
## 5 1.340354 0.4771684 1.115468
## 6 1.451502 0.4046808 1.179163
## 7 1.350447 0.4618300 1.112193
## 8 1.307533 0.5043358 1.066602
## 9 2.407935 0.5033313 1.346444
## 10 2.271019 0.4967220 1.334601
## 11 2.224866 0.4985500 1.333145
## 12 2.252773 0.4621222 1.354180
## 13 2.376942 0.4026359 1.431658
## 14 1.853431 0.5066380 1.218464
## 15 1.282595 0.4986697 1.043252
## 16 1.846756 0.5030727 1.219705
## 17 1.345225 0.4570793 1.101332
## 18 1.320995 0.4917882 1.097843
## 19 1.280712 0.5234311 1.064212
## 20 1.292781 0.5164534 1.061669
## 21 1.323369 0.4876111 1.079384
## 22 1.274669 0.5304642 1.061286
## 23 1.273597 0.5313222 1.077300
## 24 1.339381 0.5144554 1.068906
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 23.
dfr[2,] = data.frame(MOdel='LM', Tuning.Parameters=paste0('nvmax=', fit$bestTune[['nvmax']]), RMSE=min(fit$results[['RMSE']]))Robust linear regression
# Robust linear model - this errors out with "'x' is singular: singular fits are not implemented in 'rlm'"
set.seed(77)
# fit <- train(x=trainx, y=trainy, method='rlm', preprocess=c('center', 'scale', 'pca'), trControl=ctrl)
## from Kayleahs
testData_minus_yield <- trainx
trans_test <- preProcess(testData_minus_yield, method = c("center", "scale", "YeoJohnson", "nzv"))
transformed_test <- predict(trans_test, testData_minus_yield)
fit <- train(transformed_test,
trainy,
method = "rlm", preProcess = "pca", Control = ctrl)
dfr[3,] = data.frame(Model='RLM', Tuning.Parameters='', RMSE=fit$results[['RMSE']]) # tempPartial least squares
# PLS, tuneLength=5
set.seed(77)
fit <- train(x=trainx, y=trainy, method='pls', preprocess=c('center', 'scale'), trControl=ctrl, tuneLength=5)
fit## Partial Least Squares
##
## 144 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 129, 129, 129, 131, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.676215 0.2522705 1.383035
## 2 1.674497 0.2541740 1.386033
## 3 1.589468 0.2770528 1.313656
## 4 3.508917 0.2431983 1.852400
## 5 3.091332 0.2582261 1.701353
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
dfr[4,] = data.frame(Model='PLS', Tuning.Parameters='tuneLength=5', RMSE=fit$results[['RMSE']])
# PLS, tunLength=20
set.seed(77)
fit <- train(x=trainx, y=trainy, method='pls', preprocess=c('center', 'scale'), trControl=ctrl, tuneLength=20)
fit## Partial Least Squares
##
## 144 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 129, 129, 129, 131, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.676215 0.2522705 1.3830349
## 2 1.674497 0.2541740 1.3860331
## 3 1.589468 0.2770528 1.3136558
## 4 3.508917 0.2431983 1.8523999
## 5 3.091332 0.2582261 1.7013526
## 6 3.606812 0.3286406 1.8396639
## 7 3.267749 0.3626882 1.6871394
## 8 2.849073 0.3859410 1.5535857
## 9 2.707362 0.3868188 1.4979158
## 10 2.554798 0.4163303 1.4226499
## 11 2.412650 0.4448560 1.3498822
## 12 2.902913 0.4779650 1.4479312
## 13 1.703657 0.5111518 1.1475910
## 14 1.303947 0.5818938 0.9918495
## 15 1.308174 0.6185297 0.9794667
## 16 1.256095 0.6403446 0.9573178
## 17 1.527881 0.6179692 1.0400251
## 18 1.764593 0.6134669 1.1002319
## 19 1.786071 0.6039765 1.1147107
## 20 1.868231 0.5853140 1.1452380
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 16.
dfr[5,] = data.frame(Model='PLS', Tuning.Parameters='tuneLength=20', RMSE=fit$results[['RMSE']])Ridge regression
# Ridge regression
set.seed(77)
ridgeGrid <- data.frame(.lambda=seq(0, 0.1, length=15))
fit <- train(x=trainx, y=trainy, method='ridge', preprocess=c('center', 'scale'), trControl=ctrl, tuneGrid=ridgeGrid)
fit## Ridge Regression
##
## 144 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 129, 129, 129, 131, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 2.286066 0.5084493 1.369764
## 0.007142857 1.828814 0.5243584 1.221142
## 0.014285714 1.672191 0.5359743 1.163832
## 0.021428571 1.582316 0.5444225 1.129514
## 0.028571429 1.521912 0.5509506 1.107115
## 0.035714286 1.477937 0.5562217 1.090175
## 0.042857143 1.444310 0.5606085 1.076822
## 0.050000000 1.417720 0.5643401 1.066649
## 0.057142857 1.396178 0.5675674 1.058259
## 0.064285714 1.378407 0.5703949 1.051740
## 0.071428571 1.363539 0.5728981 1.046324
## 0.078571429 1.350962 0.5751332 1.041552
## 0.085714286 1.340233 0.5771427 1.037317
## 0.092857143 1.331019 0.5789601 1.033655
## 0.100000000 1.323064 0.5806118 1.030542
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
dfr[6,] = data.frame(Model='Ridge', Tuning.Parameters=paste0('labmda=', fit$bestTune[['lambda']]), RMSE=fit$results[['RMSE']])Lasso regression
# Ridge regression
set.seed(77)
enetGrid <- expand.grid(.lambda=c(0, 0.01, 0.1), .fraction=seq(0.05, 1, length=20))
fit <- train(x=trainx, y=trainy, method='enet', preprocess=c('center', 'scale'), trControl=ctrl, tuneGrid=enetGrid)
fit## Elasticnet
##
## 144 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 129, 129, 129, 131, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 1.433201 0.5856073 1.1901593
## 0.00 0.10 1.209237 0.6032783 1.0185864
## 0.00 0.15 1.140548 0.6111652 0.9536177
## 0.00 0.20 1.124102 0.6187659 0.9322020
## 0.00 0.25 1.121581 0.6200730 0.9283358
## 0.00 0.30 1.118237 0.6243936 0.9188952
## 0.00 0.35 1.117536 0.6276211 0.9146605
## 0.00 0.40 1.120802 0.6289250 0.9137445
## 0.00 0.45 1.131374 0.6254270 0.9208721
## 0.00 0.50 1.143630 0.6203770 0.9322169
## 0.00 0.55 1.159585 0.6121583 0.9473379
## 0.00 0.60 1.240437 0.5865593 0.9936258
## 0.00 0.65 1.370591 0.5687979 1.0475752
## 0.00 0.70 1.496816 0.5555727 1.0982473
## 0.00 0.75 1.629331 0.5432079 1.1493911
## 0.00 0.80 1.783242 0.5313500 1.2046776
## 0.00 0.85 1.946187 0.5209446 1.2605442
## 0.00 0.90 2.044545 0.5151547 1.2951275
## 0.00 0.95 2.162856 0.5114413 1.3323160
## 0.00 1.00 2.286066 0.5084493 1.3697641
## 0.01 0.05 1.539145 0.5182418 1.2725262
## 0.01 0.10 1.330375 0.5904373 1.1077079
## 0.01 0.15 1.193655 0.6047618 1.0002440
## 0.01 0.20 1.150271 0.6122303 0.9614231
## 0.01 0.25 1.121654 0.6228151 0.9368904
## 0.01 0.30 1.116869 0.6235431 0.9303592
## 0.01 0.35 1.116861 0.6244385 0.9230302
## 0.01 0.40 1.118073 0.6259194 0.9172130
## 0.01 0.45 1.118784 0.6272024 0.9131786
## 0.01 0.50 1.119599 0.6283309 0.9133613
## 0.01 0.55 1.123014 0.6279402 0.9155330
## 0.01 0.60 1.129705 0.6250074 0.9217190
## 0.01 0.65 1.150129 0.6160037 0.9396739
## 0.01 0.70 1.209070 0.6000021 0.9714812
## 0.01 0.75 1.291145 0.5872609 1.0063842
## 0.01 0.80 1.381933 0.5769310 1.0417455
## 0.01 0.85 1.472187 0.5678927 1.0771760
## 0.01 0.90 1.565296 0.5562961 1.1170458
## 0.01 0.95 1.665349 0.5425771 1.1582725
## 0.01 1.00 1.753162 0.5294971 1.1940172
## 0.10 0.05 1.613736 0.4781463 1.3352216
## 0.10 0.10 1.478266 0.5518656 1.2214140
## 0.10 0.15 1.348772 0.5879584 1.1184758
## 0.10 0.20 1.245124 0.6021145 1.0419137
## 0.10 0.25 1.181267 0.6078709 0.9902300
## 0.10 0.30 1.158867 0.6100634 0.9686527
## 0.10 0.35 1.136495 0.6173264 0.9502379
## 0.10 0.40 1.124044 0.6224835 0.9407894
## 0.10 0.45 1.118174 0.6270984 0.9323668
## 0.10 0.50 1.117726 0.6285291 0.9298664
## 0.10 0.55 1.120394 0.6288807 0.9248968
## 0.10 0.60 1.127845 0.6274251 0.9235546
## 0.10 0.65 1.134053 0.6253379 0.9210281
## 0.10 0.70 1.139854 0.6233843 0.9215161
## 0.10 0.75 1.157962 0.6180590 0.9328568
## 0.10 0.80 1.181786 0.6121834 0.9471762
## 0.10 0.85 1.209329 0.6065917 0.9614510
## 0.10 0.90 1.238325 0.6011691 0.9792009
## 0.10 0.95 1.277395 0.5916046 1.0042866
## 0.10 1.00 1.323064 0.5806118 1.0305422
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.35 and lambda = 0.01.
dfr[7,] = data.frame(
Model='Lasso (enlastic net)',
Tuning.Parameters=paste0('lambda=', fit$bestTune[['lambda']], ', fraction=', fit$bestTune[['fraction']]),
RMSE=fit$results[['RMSE']]
)# Model comparison
dfr %>%
kbl(caption='Model comparison') %>%
kable_classic(full_width=F)| Model | Tuning.Parameters | RMSE |
|---|---|---|
| LM | 2.286066 | |
| LM | nvmax=23 | 1.273597 |
| RLM | 40.127835 | |
| PLS | tuneLength=5 | 1.676215 |
| PLS | tuneLength=20 | 1.676215 |
| Ridge | labmda=0.1 | 2.286066 |
| Lasso (enlastic net) | lambda=0.01, fraction=0.35 | 1.433201 |
| NA | NA | NA |
| NA | NA | NA |
| NA | NA | NA |
The stepwise linear model with max parameters=9 was the best-performing model under training conditions, with an RMSE of 1.255.
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
First, calculate the accuracy based on the test test.
# Rerun the model - stepwise linear model using MASS package/caret
set.seed(77)
stepGrid <- data.frame(.nvmax=seq(1, round(ncol(trainx) / 2, 0))) # Max number of parameters to use
fit <- train(x=trainx, y=trainy, method='leapSeq', trControl=ctrl, tuneGrid=stepGrid)## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
# Generate y predictions
predy <- predict(fit, newdata=testx)
# Get accuracy results
accuracy(predy, testy)## ME RMSE MAE MPE MAPE
## Test set -0.8679355 5.979448 2.014883 -2.224604 4.945104
As expected, the RMSE for the test set is higher than that of the training set (1.335979 vs 1.225019, respectively). However, the test RMSE is only slightly higher, and the low MAPE value of 2.53 indicates good model performance.
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
# Extract coefficients from model
dftop <- data.frame(Coefficient=coef(fit$finalModel, fit$bestTune[['nvmax']] + 1))
dftop %>%
kbl(caption='Model coefficients') %>%
kable_classic(full_width=F)| Coefficient | |
|---|---|
| (Intercept) | 75.6781250 |
| BiologicalMaterial01 | -0.1273122 |
| BiologicalMaterial05 | 0.1797094 |
| BiologicalMaterial07 | -1.7364640 |
| BiologicalMaterial09 | 0.3004392 |
| BiologicalMaterial11 | -0.0097183 |
| ManufacturingProcess02 | -0.0245667 |
| ManufacturingProcess03 | -7.7851401 |
| ManufacturingProcess04 | 0.0652808 |
| ManufacturingProcess09 | 0.3708992 |
| ManufacturingProcess12 | 0.0001340 |
| ManufacturingProcess13 | -0.4367373 |
| ManufacturingProcess15 | 0.0089976 |
| ManufacturingProcess16 | -0.0071973 |
| ManufacturingProcess19 | 0.0057596 |
| ManufacturingProcess20 | 0.0004919 |
| ManufacturingProcess30 | 0.0413937 |
| ManufacturingProcess33 | 0.1585066 |
| ManufacturingProcess34 | 7.9789257 |
| ManufacturingProcess36 | -374.0649304 |
| ManufacturingProcess37 | -0.4717292 |
| ManufacturingProcess38 | 0.0270597 |
| ManufacturingProcess39 | 0.1039607 |
| ManufacturingProcess44 | 0.6414816 |
| ManufacturingProcess21 | -0.1454709 |
As shown, the five most important variables are all those related to the manufacturing process. This works out favorably, as these are elements that can be controlled or adjusted to have an effect on yield, whereas the biological predictors can’t be changed.
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
# Iterate over top parameters
for (p in rownames(dftop)) {
if (p != '(Intercept)') { # do not include the intercept
plt <- dfchem2 %>%
ggplot()
if (p == 'ManufacturingProcess01' | p == 'ManufacturingProcess09') { # these are quantitative variables
plt <- plt +
geom_point(aes(x=eval(sym(p)), y=Yield))
} else { # these are categorical variables
plt <- plt +
geom_bar(aes(x=eval(sym(p)), y=Yield), stat='identity')
}
plt <- plt +
xlab(p)
print(plt)
}
}